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Abstract — A class of distance measures on probabilities — 
the integral probability metrics (IPMs) — is addressed: these 
include the Wasserstein distance, Dudley metric, and Maximum 
Mean Discrepancy. IPMs have thus far mostly been used in 
more abstract settings, for instance as theoretical tools in mass 
transportation problems, and in metrizing the weak topology 
on the set of all Borel probability measures defined on a metric 
space. Practical applications of IPMs are less common, with some 
exceptions in the kernel machines literature. The present work 
contributes a number of novel properties of IPMs, which should 
contribute to making IPMs more widely used in practice, for 
instance in areas where ^-divergences are currently popular. 

First, to understand the relation between IPMs and cj>- 
divergences, the necessary and sufficient conditions under which 
these classes intersect are derived: the total variation distance 
is shown to be the only non-trivial ^-divergence that is also an 
IPM. This shows that IPMs are essentially different from cj>- 
divergences. Second, empirical estimates of several IPMs from 
finite i.i.d. samples are obtained, and their consistency and 
convergence rates are analyzed. These estimators are shown to 
be easily computable, with better rates of convergence than 
estimators of ^-divergences. Third, a novel interpretation is 
provided for IPMs by relating them to binary classification, where 
it is shown that the IPM between class-conditional distributions is 
the negative of the optimal risk associated with a binary classifier. 
In addition, the smoothness of an appropriate binary classifier 
is proved to be inversely related to the distance between the 
class-conditional distributions, measured in terms of an IPM. 

Index Terms — Integral probability metrics, ^-divergences, 
Wasserstein distance, Dudley metric. Maximum mean discrep- 
ancy. Reproducing kernel Hilbert space, Rademacher average, 
Lipschitz classifier, Parzen window classifier, support vector 
machine. 



I. Introduction 

THE notion of distance between probability measures has 
found many appUcations in probability theory, math- 
ematical statistics and information theory [l]-[3]. Popular 
applications include distribution testing, establishing central 
Umit theorems, density estimation, signal detection, channel 
and source coding, etc. 
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One of the widely studied and well understood families of 
distances/divergences between probability measures is the Ali- 
Silvey distance [4], also called the Csiszdr's (p-divergence [5], 
which is defined as 
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where M is a measurable space and : [0,cx)) — > (— cx),oo] 
is a convex functionQ P <C Q denotes that P is absolutely 
continuous w.r.t. Q. Well-known distance/divergence measures 
obtained by appropriately choosing include the Kullback- 
Liebler (KL) divergence (0(i) ~ t\ogt), Hellinger distance 
{(j){t) = (Vt - 1)^), total variation distance {4>(t) = |i - 1|), 
X^-divergence — {t — 1)^), etc. See [2], [6] and refer- 

ences therein for selected statistical and information theoretic 
applications of (/(-divergences. 

In this paper, we consider another popular family (partic- 
ularly in probability theory and mathematical statistics) of 
distance measures: the integral probability metrics (IPMs) [7], 
defined as 
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where 9^ in (|2]l is a class of real-valued bounded measurable 
functions on M. So far, IPMs have been mainly studied as 
tools of theoretical interest in probability theory [3], [8], [9, 
Chapter 11], with limited applicability in practice. Therefore, 
in this paper, we present a number of novel properties of IPMs, 
which will serve to improve their usefulness in more applied 
domains. We emphasize in particular the advantages of IPMs 
compared to (/)-divergences. 

0-divergences, and especially the KL-divergence, are better 
known and more widely used in diverse fields such as neuro- 
science [10]-[12] and distribution testing [13]-[16], however 
they are notoriously tough to estimate, especially in high 
dimensions, d, when M = R^, e.g., see [17]. By contrast, 
we show that under certain conditions on 3^, irrespective of the 
dimension, d, IPMs are very simple to estimate in a consistent 
manner This property can be exploited in statistical applica- 
tions where the distance between P and Q is to be estimated 
from finite data. Further, we show that IPMs are naturally 
related to binary classification, which gives these distances a 
clear and natural interpretation. Specifically, we show that (a) 
the smoothness of a binary classifier is inversely related to the 
distance between the class-conditional distributions, measured 



'Usually, the condition <^(1) = is used in the definition of < 
Here, we do not enforce this condition. 



-divergence. 
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in terms of IPM, and (b) the IPM between the class-conditional 
distributions is the negative of the optimal risk associated with 
an appropriate binary classifier. We will go into more detail 
regarding these contributions in Section IT^ First, we provide 
some examples of IPMs and their applications. 

A. Examples and Applications of IPMs 

The definition of IPMs in (|2]) is motivated from the notion 
of weak convergence of probability measures on metric spaces 
[9, Section 9.3, Lemma 9.3.2]. In probability theory, IPMs are 
used in proving central limit theorems using Stein's method 
[18], [19]. They are also the fundamental quantities that appear 
in empirical process theory [20], where Q is replaced by the 
empirical distribution of P. 

Various popular distance measures in probability theory 
and statistics can be obtained by appropriately choosing J'. 
Suppose (Af , p) is a metric space with A being the Borel a- 
algebra induced by the metric topology. Let ^ be the set of 
all Borel probability measures on A. 

1) Dudley metric: Choose 3^ — {f : < 1} in 

©, where \\f\\BL := ||/||oo + ||/||oo sup{|/(a;)| : 

X e M} and ll/lli := sup{|/(x) - f{y)\/p{x,y) : x + 
y in A/}. ||/||l is called the Lipschitz semi-norm of a real- 
valued function / on AI [21, Chapter 19, Definition 2.2]. The 
Dudley metric is popularly used in the context of proving the 
convergence of probability measures with respect to the weak 
topology [9, Chapter 11]. 

2} Kantorovich metric and Wasserstein distance: Choosing 
= {/ • II/IIl < l}in(|2]) yields the Kantorovich metric. The 
famous Kantorovich-Rubinstein theorem [9, Theorem 11.8.2] 
shows that when M is separable, the Kantorovich metric is 
the dual representation of the so called Wasserstein distance 
defined as 

M^i(P,Q):= inf /" y) y), (3) 

where P,Q € := {P : J p{x,y)dP{x) < oo, Vy G Af} 
and £(P, Q) is the set of all measures on M x Af with 
marginals P and Q. Due to this duality, in this paper, we 
refer to the Kantorovich metric as the Wasserstein distance 
and denote it as W when M is separable. The Wasserstein 
distance has found applications in information theory [22], 
mathematical statistics [23], [24], mass transportation prob- 
lems [8] and is also called as the earth mover's distance in 
engineering applications [25]. 

3) Total variation distance and Kolmogorov distance: 73^ 
is the total variation metric when 3^ = {/ : ||/||oo < 1} 
while it is the Kolmogorov distance when 3^ = {ll(_oo,t] ■ 
t E M'*}. Note that the classical central limit theorem and the 
Berry-Esseen theorem in M.'^ use the Kolmogorov distance. 
The Kolmogorov distance also appears in hypothesis testing 
as the Kolmogorov-Smirnov statistic [21]. 

4) Maximum mean discrepancy: 7gr is called the maximum 
mean discrepancy (MMD) [26], [27] when ?={/: ||/|k< 
1}. Here, !K represents a reproducing kernel Hilbert space 



(RKHS) [28], [29] with k as its reproducing kernel (r.k.)0 
MMD is used in statistical applications including homogeneity 
testing [26], independence testing [30], and testing for condi- 
tional independence [31]. 

B. Contributions 

Some of the previously mentioned IPMs, e.g., the Kan- 
torovich distance and Dudley metric, are mainly tools of 
theoretical interest in probability theory. That said, their ap- 
plication in practice is generally less well established. The 
Dudley metric has been used only in the context of metrizing 
the weak topology on ^ [9, Chapter 11]. The Kantorovich 
distance is more widespread, although it is better known in 
its primal form in (O as the Wasserstein distance than as an 
IPM [3], [8]. The goal of this work is to present a number 
of favourable statistical and implementational properties of 
IPMs, and to specifically compare IPMs and t/i-divergences. 
Our hope is to broaden the applicability of IPMs, and to 
encourage their wider adoption in data analysis and statistics. 
The contributions of this paper are three-fold, and explained 
in detail below. 

1) IPMs and (^-divergences: Since (^-divergences are well 
studied and understood, the first question we are interested 
in is whether IPMs have any relation to ^-divergences. In 
particular, we would like to know whether any of the IPMs 
can be realized as a ^-divergence, so that the properties of 
^-divergences will carry over to those IPMs. In Section |II] we 
first show that 73^ is closely related to the variational form of 

[32]-[34] and is "trivially" a i/i-divergence if J is chosen to 
be the set of all real-valued measurable functions on A/ (see 
Theorem [TJ. Next, we generalize this result by determining 
the necessary and sufficient conditions on "J and for which 
73r(P, (Q) = D0(P, Q), VP, Q G ^0 C ^, where ^0 is some 
subset of 3^. This leads to our first contribution, answering 
the question, "Given a set of distance/divergence measures, 
{73^ : "J} (indexed by T) and {D^ : cj)} (indexed by (p) defined 
on is there a set of distance measures that is common to 
both these families?" We show that the classes {73^ : 3^} and 
{D^ : (j)} of distance measures intersect non-trivially only at 
the total variation distance, which in turn indicates that these 
classes are essentially different and therefore the properties of 
^-divergences will not carry over to IPMs. 

2) Estimation of IPMs: Many statistical inference appli- 
cations such as distribution testing involve the estimation of 
distance between probability measures P and Q based on finite 
samples drawn i.i.d. from each. We first consider the properties 
of finite sample estimates of the 0-divergence, which is a well- 
studied problem (especially for the KL-divergence; see [17], 
[35] and references therein). Wang et al. [17] used a data- 
dependent space partitioning scheme and showed that the non- 
parametric estimator of KL-divergence is strongly consistent. 
However, the rate of convergence of this estimator can be 

function k : M X M M, {x,y) 1— > k{x,y) is a reproducing 
kernel of the Hilbert space if and only if the following hold: (i) 
My e M, k(;y) 6 :K and (ii) Vy &M,Vfe {f,k{;y))^ = f{y). 
IK is called a reproducing kernel Hilbert space. 
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arbitrarily slow depending on the distributions. In addition, 
for increasing dimensionality of the data (in M.'^), the method 
is increasingly difficult to implement. On the other hand, 
by exploiting the variational representation of ^-divergences, 
Nguyen et al. [35] provide a consistent estimate of a lower 
bound of the KL-divergence by solving a convex program. 
Although this approach is efficient and the dimensionality of 
the data is not an issue, the estimator provides a lower bound 
and not the KL-divergence itself. Given the disadvantages 
associated with the estimation of (/)-divergences, it is of interest 
to compare with the convergence behaviour of finite sample 
estimates of IPMs. 

To this end, as our second and "main" contribution, in 
SectionHnl we consider the non-parametric estimation of some 
IPMs, in particular the Wasserstein distance, Dudley metric 
and MMD based on finite samples drawn i.i.d. from P and Q. 
The estimates of the Wasserstein distance and Dudley metric 
are obtained by solving linear programs while an estimator of 
MMD is computed in closed form (see Section IIII-Al i. One 
of the advantages with these estimators is that they are quite 
simple to implement and are not affected by the dimensionality 
of the data, unlike ^-divergences. Next, in Section IIII-BI we 
show that these estimators are strongly consistent and provide 
their rates of convergence, using concentration inequalities and 
tools from empirical process theory [20]. In Section IIII-CI 
we describe simulation results that demonstrate the practical 
viability of these estimators. The results show that it is simpler 
and more efficient to use IPMs instead of (/)-divergences in 
many statistical inference applications. 

Since the total variation distance is also an IPM, in Sec- 
tion IIII-DI we discuss its empirical estimation and show that 
the empirical estimator is not strongly consistent. Because 
of this, we provide new lower bounds for the total variation 
distance in terms of the Wasserstein distance, Dudley metric, 
and MMD, which can be consistently estimated. These bounds 
also translate as lower bounds on the KL-divergence through 
Pinsker's inequality [36]. 

Our study shows that estimating IPMs (especially the 
Wasserstein distance, Dudley metric and MMD) is much 
simpler than estimating ^-divergences, and that the estimators 
are strongly consistent while exhibiting good rates of conver- 
gence. In addition, IPMs also account for the properties of 
the underlying space M (the metric property is determined by 
p in the case of Wasserstein and Dudley metrics, while the 
similarity property is determined by the kernel k [37] in the 
case of MMD) while computing the distance between P and 
Q, which is not the case with ^-divergences. This property 
is useful when P and Q have disjoint support|l With these 
advantages, we believe that IPMs can find many applications 
in information theory, image processing, machine learning, 
neuroscience and other areas. 

3) Interpretability of IPMs: Relation to Binary 
Classification: Finally, as our third contribution, we 

^When P and Q have disjoint support, ^^(P, Q) = +oo irrespective of the 
properties of M, while 73- (P, Q) varies with the properties of M . Therefore, 
in such cases, 7gr (P, Q) provides a better notion of distance between P and 



provide a nice interpretation for IPMs by showing they 
naturally appear in binary classification. Many previous 
works [6], [33], [38], [39] relate (/)-divergences (between 
P and Q) to binary classification (where P and Q are the 
class conditional distributions) as the negative of the optimal 
risk associated with a loss function (see [40, Section L3] 
for a detailed list of references). In Section |IV] we present 
a series of results that relate IPMs to binary classification. 
First, in Section IIV-AI we provide a result (similar to that 
for 0-divergences), which shows 73^ (P, Q) is the negative 
of the optimal risk associated with a binary classifier that 
separates the class conditional distributions, P and (Q, where 
the classification rule is restricted to "J. Therefore, the Dudley 
metric, Wasserstein distance, total variation distance and 
MMD can be understood as the negative of the optimal 
risk associated with a classifier for which the classification 
rule is restricted to {/ : < 1}, {/ : ||/||l < 1}, 

{/ : ll/lloc- < 1} an d {/ : \\f\\^ < 1} respectively. Next, 
in Sections IIV-BI and IIV-CI we present a second result that 
relates the empirical estimators studied in Section |III] to 
the binary classification setting, by relating the empirical 
estimators of the Wasserstein distance and Dudley metric 
to the margins of the Lipschitz [41] and bounded Lipschitz 
classifiers, respectively; and MMD to the Parzen window 
classifier [37], [42] (see kernel classification rule [43, Chapter 
10]). The significance of this result is that the smoothness of 
the classifier is inversely related to the empirical estimate of 
the IPM between class conditionals P and Q. Although this is 
intuitively clear, our result provides a theoretical justification. 

Before proceeding with our main presentation, we introduce 
the notation we will use throughout the paper Certain proofs 
and supplementary results are presented in a collection of 
appendices, and referenced as needed. 

C. Notation 

For a measurable function / and a signed measure P, 
P/ //(iP denotes the expectation of / under P. {A} 
represents the indicator function for set A. Given an i.i.d. 
sample Xi, . . . , drawn from P, P„ := ^ Y^=i repre- 
sents the empirical distribution, where 5x represents the Dirac 
measure at x. We use P„/ to represent the empmcal expec- 
tation iEr=i/(^0- We define: Lip(Af,p) := {f : M 
M|||/||l < 00}, BL{M,p) {/ : Af ^K|||/||sl < ex.}, 
and 



7w {/ : 

^0 {/ : 

?fc {/ : 

7tv {/ 



||/||l<1}, W{¥M 

/||bl<1}, /3(P,Q):^ 

|/lk<l}, 7/c(P,Q) 

||/lloo<l}, TV{¥M, 



II. IPMs AND ^-DIVERGENCES 

In this section, we consider {79^ : J} and {D^ : </)}, which 
are classes of IPMs and ^-divergences on indexed by J 
and (f), respectively. We derive conditions on J and </) such that 
VP, Q e ^0 C 7;r(P,Q) = D^[¥,Q) for some chosen 
^0- This shows the degree of overlap between the class of 
IPMs and the class of (/)-divergences. 
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Consider the variational form of I?^ [32], [34], [35] given 



= sup 
= sup 



fdF- / (/.*(/) dQ 

M JM 

(P/-Q0*(/)), 



(4) 



where (j)*{t) = supjtu— : u G R} is the convex conjugate 
of Suppose 3^ is such that / e?'^— /eJ. Then, 

7j(P, Q) = sup |P/ - Q/l - sup(P/ - Q/). (5) 

Recently, Reid and Williamson [40, Section 8.2] considered 
the generalization of by modifying its variational form as 



^0.?(P,Q) :-sup(P/-Q0*(/)). 



(6) 



Let 'Ji, be the set of all real-valued measurable functions on 
M and let be the convex function defined as in (|7]i. It is 
easy to show that = u. Comparing 73^ in Q to 

in ^ through -D^^g^ in (|6]l, we see that 7gr — D^^ jj: and 
= D^ g-^. This means 73^ is obtained by fixing cj) to (p^, 
in -D0,g^ with 3^ as the variable and is obtained by fixing 
3^ to J'j, in -0^,3^ with (j) as the variable. This provides a nice 
relation between 73^ and Z?^, leading to the following simple 
result which shows that 73^^ is "trivially" a 0-divergence. 

Theorem 1 ("73^^ is a (j}-divergence): Let 3^* be the set of 
all real-valued measurable functions on M and let 



</).(t) = 0p = ll+«3[i^l]. 



(7) 



Then 



7j,(P,Q) = i?^,(I 



= m = 



(8) 



Conversely, 73^ (P, Q) = ^^(P, Q) = 0[P = Q] + c 
implies 3 — Ji, and </> = 0*. 

Proof: ^ simply follows by using 3^^ and 0^ in 73? 
and or by using = w in (|4|i. For the converse, 

note that D^{f,Q) = OfP = Ql + oo|P ^ QJ implies 
= and / 0(dP/dQ) dQ = cx), VP ^ Q, which means 
(t){x) — 00, Va; 7^ 1 and so = 0*. Consider 73r(P, (Q) = 
7j. (P, Q) = sup{P/ - Q/ : / e ^4, VP, Q e i^. Suppose 
J C J^. Then it is easy to see that 73r(P, Q) < 73^^^ (P, Q) for 
some P, Q G which leads to a contradiction. Therefore, 
3^ = 3,. ■ 

From (l8]l, it is clear that 73^^ (P, Q) is the strongest way to 
measure the distance between probability measures, and is not 
a very useful metric in practice^ We therefore consider a more 
restricted function class than 3^^ resulting in a variety of more 
interesting IPMs, including the Dudley metric, Wasserstein 
metric, total variation distance, etc. Now, the question is for 
what other, more restricted function classes 3 does there 
exist a such that 73^ is a (/(-divergence? We answer this 
in the following theorem, where we show that the total- 
variation distance is the only "non-trivial" IPM that is also a 
^-divergence. We first introduce some notation. Let us define 



^Unless P and Q are exactly the same, 73-^ 
is a trivial and useless metric in practice. 



+00 and therefore 



as the set of all probability measures, P that are absolutely 
continuous with respect to some u-finite measure, A. For 
P e ^^x, let p = ^ be the Radon-Nikodym derivative of 
P with respect to A. Let $ be the class of all convex functions 

(j) : [0, 00) (—00, 00] continuous at and finite on (0, 00). 

Theorem 2 (Necessary and sufficient conditions): Let 3 C 
3^, and e $. Then for any P,Q G ^a, 73^(P,Q) = 
^'^(P; Q) if and only if any one of the following hold: 

(i) 3={J ■.\\f\\^<^}, 



some a < /? < 00. 



< M < 1| + - l)lu > 1| for 



(ii) 3 = {f ■.f = c,ceR}, 

(j){u) = a(u - > 0|, a e M. 

The proof idea is as follows. First note that 73^ in ([2]) is a 
pseudometricH on for any 3. Since we want to prove 73^ = 
Dcf,, this suggests that we first study the conditions on (f> for 
which I?0 is a pseudometric. This is answered by Lemma [3] 
which is a simple modification of a result in [44, Theorem 2]. 

Lemma 3: For G is a pseudometric on ^\ if and 

only if is of the form 

= a(u - 1)10 < u < 11 + f3{u - l)lu > 11, (9) 

for some (3 > a. 

Proof: See Appendix lAl ■ 

The proof of Lemma [3] uses the following result from [44], 
which is quite easy to prove. 

Lemma 4 ( [44]): For (j) in (HJ, 



\p-q\dX, 



(10) 



M 



for any P, Q G where p and q are the Radon-Nikodym 
derivatives of P and Q with respect to A. 

Lemma m shows that ^'^(P, Q) in ( fTol i associated with (p in 
^ is proportional to the total variation distance between P 
and Q. Note that the total variation distance between P and 
Q can be written as J^^ |p ^ where p and q are defined 

as in Lemma m 

Proof of Theorem^ (<=) Suppose f/j holds. Then for 
any P, Q G we have 



77 ( 



sup||P/-(Q/| : ll/l 
P-a 



< 



(3- a 



2 

/3 — a 



sup{|P/-Q/|:||/||oo<l} 



\p - q\ dX = 



2 Jm 

where (a) follows from Lemma |4] 

^ Given a set M, a metric for M is a function p : M X M —* M.-^ such that 
(i) Vx, p{x,x) = 0, (ii) Vx,y, p{x,y) = p{y,x), (Hi) ^ x,y, z, p(x,z) < 
p(x,y) + p{y,z), and (iv) p{x,y) = => x = y. A pseudometric only 
satisfies (i)-(iii) of the properties of a metric. Unlike a metric space (M,p), 
points in a pseudometric space need not be distinguishable: one may have 
p(x, y) = for X ^ y. 
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Suppose (ii) holds. Then 7g^(P,Q) = and ^^(P,^) = 
^ Im 'i^(P/ q)dX = a Jj^j {p ~q)d\^ 0. 

(^) Suppose 7t(IP,<Q) = ^0(P,Q) for any P,Q e 
Since 73? is a pseudometric on ,f3^A (irrespective of 5), Z?^ 
is a pseudometric on 3^x. Therefore, by Lemma [3] — 
a{u - 1)|0 < w < IJ + - > 1] for some (3 > a. 
Now, let us consider two cases. 

Case 1: [3 > a 

By Lemma H D^(f,'^) = ^ /^^ |p - q\d\. Since 
7gr(P,(Q)) = i:>0(P,Q) for all P, Q € ^a, we have 
7:r(P,Q) = ^/,,b-(7|rfA - ^sup{|P/-Q/| : 
ll/IIco < 1} = sup{|P/ - Q/l : 11/11^, < ^} and therefore 
:^-{/:||/||oo<%^}. 
Case 2: (3 = a 

<l){u) = a{u - l),it > 0, a < 00. Now, D0(P,Q) = 
jji/ 90(^/9) dX = a ^\^{p -q) = for all P, Q e ^a- There- 
fore, 7:?(P,Q) = sup^e:r |P/ - Q/l = for all P,Q e ^a, 
which means VP, Q G ^a, V / € 3^, P/ = Qf. This, in turn, 
means / is a constant on AI, i.e., J' — {f : f — c, c E E}. ■ 

Note that in Theorem |2] the cases (i) and (ii) are disjoint 
as a < /? in case (i) and a = /3 in case (ii). Case f/j 
shows that the family of ^-divergences and the family of 
IPMs intersect only at the total variation distance, which 
follows from Lemma |4] Case ( ii) is trivial as the distance 
between any two probability measures is zero. This result 
shows that IPMs and (/)-divergences are essentially different. 
Theorem|2]also addresses the open question posed by Reid and 
Williamson [40, pp. 56] of "whether there exist 3^ such that 73^ 
is not a metric but equals for some (p ^ 1 1-^ |t — 1|?" This 
is answered affirmatively by case (ii) in Theorem |2] as 73^ with 
3^={/ : / = c, ceM} is a pseudometric (not a metric) on 
^A but equals L»0 for = > Oj ^ u ^ 

III. Non-parametric Estimation of IPMs 

As mentioned in Section II-B2I the estimation of distance 
between P and Q is an important problem in statistical 
inference applications like distribution testing, where P and 
Q are known only through random i.i.d. samples. Another 
instance where an estimate of the distance between P and Q 
is useful is as follows. Suppose one wishes to compute the 
Wasserstein distance or Dudley metric between P and Q. This 
is not straightforward as the explicit calculation, i.e., in closed 
form, is difficult for most concrete examples]^ Similar is the 
case with MMD and ^-divergences for certain distributions, 
where the one approach to compute the distance between P 
and Q is to draw random i.i.d. samples from each, and estimate 
the distance based on these samples. We need the estimator to 

*The explicit form for the Wasserstein distance in (5) is known for 
{M,p{x,y)) = (R,\x - y\) [2], [45], which is given as VKi(P,Q) = 
J(0,l) l'^p"'('^) - -^Q 'C'^)! = \Fp{x) - Fq{x)\ dx, where Fr{x) = 
V{{-oo,x]) and F~''-{u) = inf{x S M|Fp(x) > «}, < u < 1. It is 
easy to show that this explicit form can be extended to (M'', |j ■ ||i). However, 
the exact computation (in closed form) of Wi(P, Q) is not straightforward 
for all P and Q. See Section HlFcl for some examples where W\ (P, Q) can 
be computed exactly. Note that since M.'^ is separable, by the Kantorovich- 
Rubinstein theorem, W{P,Q) = Wi(P,Q), VP, Q. 



be such that the estimate converges to the true distance with 
large sample sizes. 

To this end, the non-parametric estimation of ^-divergences, 
especially the KL-divergence is well studied (see [17], [35], 
[46] and references therein). As mentioned before, the draw- 
back with (/)-divergences is that they are difficult to estimate 
in high dimensions and the rate of convergence of the esti- 
mator can be arbitrarily slow depending on the distributions 
[17]. Since IPMs and ^-divergences are essentially different 
classes of distance measures on in Section IIII-AI we 
consider the non-parametric estimation of IPMs, especially 
the Wasserstein distance, Dudley metric and MMD. We show 
that the Wasserstein and Dudley metrics can be estimated by 
solving linear programs (see Theorems |5] and |6l) whereas an 
estimator for MMD can be obtained in closed form ( [26]; see 
Theorem [T] below). These results are significant because to 
our knowledge, statistical applications (e.g. hypothesis tests) 
involving the Wasserstein distance in (O are restricted only 
to M [47] as the closed form expression for the Wasserstein 
distance is known only for M (see footnote |6|. 

In Section IIII-BI we present the consistency and conver- 
gence rate analysis of these estimators. To this end, in Theo- 
rem[8l we present a general result on the statistical consistency 
of the estimators of IPMs by using tools from empirical 
process theory [20]. As a special case, in Corollary |9l we 
show that the estimators of Wasserstein distance and Dudley 
metric are strongly consistent, i.e., suppose {9i} is a sequence 
of estimators of 9, then 9i is strongly consistent if 9i converges 
a.s. to 9 as I 00. Then, in Theorem [TTl we provide 
a probabilistic bound on the deviation between 73^ and its 
estimate for any 3^ in terms of the Rademacher complexity 
(see Definition [TOl i, which is then used to derive the rates of 
convergence for the estimators of Wasserstein distance, Dudley 
metric and MMD in Corollary [12] Using the Borel-Cantelli 
lemma, we then show that the estimator of MMD is also 
strongly consistent. In Section IIII-CI we present simulation 
results to demonstrate the performance of these estimators. 
Overall, the results in this section show that IPMs (especially 
the Wasserstein distance, Dudley metric and MMD) are easier 
to estimate than the KL-divergence and the IPM estimators 
exhibit better convergence behavior [17], [35]. 

Since the total variation distance is also an IPM, we discuss 
its empirical estimation and consistency in Section UlI-DI By 
citing earlier work [48], we show that the empirical estimator 
of the total variation distance is not consistent. Since the 
total variation distance cannot be estimated consistently, in 
Theorem [TH we provide two lower bounds on the total 
variation distance, one involving the Wasserstein distance 
and Dudley metric and the other involving MMD. These 
bounds can be estimated consistently based on the results in 
Section IIII-BI and, moreover, they translate to lower bounds 
on the KL-divergence through Pinsker's inequality (see [36] 
and references therein for more lower bounds on the KL- 
divergence in terms of the total variation distance). 
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A. Non-parametric estimation ofWasserstein distance, Dudley 
metric and MMD 

Let {X\'\X^^\ . . ..X'^^} and {X^^\x^^\ . . .,X^^^} be 
i.i.d. samples drawn randomly from P and Q respectively. The 
empirical estimate of 7gr(P, Q) is given by 



7:F(Pm,Q«) = sup 



N 



(11) 



where P„j and Q„ represent the empirical distributions of 

and Q, N — m + n and 



(12) 



The computation of 73^ (P™, Q„) in ( fTTT i is not straightforward 
for any arbitrary 3^. In the following, we restrict ourselves to 
S'w ■■= {/ : II/IIl < 1}, := {/ : ||/||bl < 1} and 
3^fc •= {/ • 1} and compute (fTTl i. Let us denote 

:= 7?w' := l^fi and 7^ := 7gr^. 

Theorem 5 (Estimator of Wasserstein distance): For all 
a € [0, 1], the following function solves (fTTT l for = J'vi^: 



fa{x) -.^ a min (a* + /5(a;, X^)) 

■i— l,...,A^ 

+ (1 — a) max (a* — p(x,Xi)), 

i=l,...,N 



where 



AT 



(13) 



(14) 



and {a*}fLi solve the following linear program, 



AT 



max 

ai , . . .,ajY 



s.t. ~p{X„Xj) < a, - flj < p(X,,Xj),Vi,j. (15) 



= snp{j:l,YJ{X, 



Proof: Consider VF(P™,« 
||/||l < !}■ Note that 

xy^x' P[x,x') X,^X, p[X^,Xj) 

which means 

AT 

W^(Pm,Q„)<SUp Y,Y^f{X^) 

s.t. llx < L (16) 

The right hand side of (fT6b can be equivalently written as 

N 

sup 

'i=i 

s.t. -p{x,,x,) < fix,) - fix,) < pix,,Xj)yt,j. 

Let flj^ fix,). Therefore, we have H/(P„i,Q„) < 
^^^i YiU*, where {a*}^i solve the linear program in ( fTST l. 
Note that the objective in (fTSl l is linear in {a^}^]^ with linear 
inequality constraints and therefore by Theorem |24] (see Ap- 
pendix |E]i, the optimum lies on the boundary of the constraint 



set, which means maxx^^Xj jpc- x-^) ^ 1- Therefore, by 
Lemma [19] (see Appendix [Ell, / on {Xi, . . . , Xn} can be 
extended to a function /q, (on M) defined in ST3[ where 
faiXi) = fix,) = a* and = ||/||l = 1, which means 

fa is a maximizer of ([IB and T4^(P„, Q„) ^ J^^^i ■ 

Theorem 6 (Estimator of Dudley metric): For all a € 
[0, 1], the following function solves (TTi for ? = 3^^: 



Or, (a;) := max — max I a* I, min hnix), max la* 
^ ' i=i,...,jv' " V ^ i=i,...,jv' * 



(17) 



where 



ha(x) :— a min (a* + L*p(a:;, X^)) 
i=l,...,Ar 

+ (l-a) max (a- - L*p(a;, X^)), (18) 

i— l,...,Ar 



AT 



L* = max 



and {a*}fl]^ solve the following linear program. 



(19) 
(20) 



N 

max y Yitti 

2—1 

S.t. -bpiXi,Xj) < a, - a-j < bp{X„Xj), 
—c < a.i < c, y i 

b + c<l. (21) 
Proof: The proof is similar to that of Theorem [5] Note 



that 



1 > ll/IU + ll/lloo = sup '■^^''1 {^^^^ + sup \fix)\ 

x^y PiX,y) £CGM 

> max ^ffl^.IiM+„iax|/(X,)|, 



p{X„Xj 



which means 



nr^.Qn) < supjX^l'./W) : miD:|/(A',)| 

+ Il&|<M<a, ,22, 

x./x, p{Xi,Xj) i 



Let a, := fiX,). Therefore, /3(P™,Q„) < ^2=1 where 
{a*}£i solve 



AT 



max 

ai , . . .,ajv 



lOi — I 

S.t. max — — — -— +max a,; <1. (23) 
x^^x, p{X„Xj) t ' ' - 

Introducing variables b and c such that ina:iXi^Xj p('x — 
b and max^ |ai| < c reduces the program in ^ to ®. In 
addition, it is easy to see that the optimum occurs at the bound- 
ary of the constraint set and therefore maxXi^Xj x 
maxi \ai\ — 1. Hence, by Lemma [20] (see Appendix |E|, 
ga in ( [TT] ) extends / defined on {Xi, . . . ,Xn} to A/, i.e.. 
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9c,{Xi) = f{X,) and Wg^WsL = WIWbl = 1- Note that /i„ 
in ( fTSl l is the Lipschitz extension of g to A/ (by Lemma [191). 
Therefore, ga is a solution to (fTTT i and (fT9] l holds. ■ 



Theorem 7 (Estimator of MMD [26]): For 
following function is the unique solution to ( fTTT l: 



^■fc, the 



/ 



— ^r.fc(-,x,), (24) 



and 



7fc(IPm,Qn) 



AT 



1=1 



Proof: Consider 7fc(IPm,Qn) := sup{2^, K,/(X, 
< 1}, which can be written as 



(25) 



N 



7fc(IPm,Qn)= sup (/,VK,fc(-,X, 

II/IL'K<1 ^ 



(26) 



«=1 / M 

where we have used the reproducing property of !K, i.e., V / G 
Va; G M, /(x) = {f,k{-,x))^. The resuh follows from 
using the Cauchy-Schwartz inequality. ■ 

One important observation to be made is that estimators in 
Theorems|5]-[2]depend on {Xi\f^i through p or k. This means, 
once {p{Xi, Xj)}f^^i or {k{Xi^ Xjy\f^^i is known, the 
complexity of the corresponding estimators is independent of 
the dimension d when M = R'', unlike in the estimation of 
KL-divergence. In addition, because these estimators depend 
on M only through p or fc, the domain AI is immaterial as 
long as p or A; is defined on M. Therefore, these estimators 
extend to arbitrary domains unlike the KL-divergence, where 
the domain is usually chosen to be R''. 

B. Consistency and rate of convergence 

In Section IIII-AI we presented the empirical estimators of 
(3 and 7^. For these estimators to be reliable, we need them 
to converge to the population values as m, n — > cx). Even if 
this holds, we would like to have a fast rate of convergence 
such that in practice, fewer samples are sufficient to obtain 
reliable estimates. We address these issues in this section. 

Before we start presenting the results, we briefly intro- 
duce some terminology and notation from empirical process 
theory. For any r > 1 and probability measure Q, define 
flie Lr norm \\f\\Q,r ■= (/ 1/1'' dQ)^^"^ and let L^(Q) de- 
note the metric space induced by this norm. The covering 
number Af{e,3^, Lr{Q)) is the minimal number of 
balls of radius e needed to cover J. H{e,3', Lr{Q)) 
logA^(e, J, Lr.(Q)) is cafled the entropy of 3^ using the 
Lr{Q) metric. Define the minimal envelope function: F{x) := 

SUP/ggr|/(2:)|. 

We now present a general result on the strong consistency 
of 7g^(Pm,Qri)' which simply follows from Theorem 1211 (see 
Appendix |E]i. 

Theorem 8: Suppose the following conditions hold: 
(/) /FdP < 00. 
(ii) jFdQ<oo. 



(Hi) V£>0, i?^(e,J,Li(P„)) 
(iv) Ve>0, iH(e,J,Li(Q„)) - 
Then, |7:?(Pm,Qn)-7:?(IP,Q)l 



> as m — !■ 00. 
as n — > 00. 
» as TO, 71 ^ 00. 



Proof: See Appendix IbI 
The following corollary to Theorem [8] shows that 1^(1 



In) 



and /3(Pto,Q„) are strongly consistent. 

Corollary 9 (Consistency of W and (3): Let (M,p) be a 
totally bounded metric space. Then, as to, 71 00, 

(i) \w{P^,Qr.)-w{P,q)\^o. 

(ii) |/3(P™,Q„)-/3(P,Q)| ^0. 
Proof: For any / G 3^w, 

fix) < sup |/(x)| <sup|/(a;)-/(y)| < 

xeM x,y 

||/||LSupp(a;,y) < ||/||Ldiam(M) < diam(Af) < c3o, 

where diam(M) represents the diameter of M. Therefore, 
Va; G M, F{x) < diam(Af) < 00, which satisfies (/) and 
in Theorem [8] Kolmogorov and Tihomirov [49] have shown 
that 

"2diam(Af)" 



H(e,Jw,ll-|loo)<AA(-,M,p) lo, 



€ 2 



1 



(27) 

Since H(e, Jiy, Li(P™)) < H(e,Jvi/,|| • lU), the condi- 
tions (Hi) and (iv) in Theorem [8] are satisfied and therefore, 
\W{¥m,Qn) - W^(P,Q)| ^ as TO,n ^ CX3. Since 
3^/5 C 3^w^ the envelope function associated with is upper 
bounded by the envelope function associated with J'w and 
H(£, S'/3, II • lloo) < H(e, 3^w, II • Hoc)- Therefore, the result for 
P follows. ■ 

Similar to Corollary|9] a strong consistency result for 7^ can be 
provided by estimating the entropy number of J'k- See Cucker 
and Zhou [50, Chapter 5] for the estimates of entropy numbers 
for various JC. However, in the following, we adopt a different 
approach to prove the strong consistency of '^k- To this end, 
we first provide a general result on the rate of convergence 
of 7g^(Pm,Qn) and then, as a special case, obtain the rates 
of convergence of the estimators of W, (3 and 7^. Using this 
result, we then prove the strong consistency of 7^. We start 
with the following definition. 

Definition 10 (Rademacher complexity): Let 3^ be a class 
of functions on M and {ctj}™ be independent Rademacher 
random variables, i.e., Pr(cri = +1) = Pr((Ti = —1) = j- The 
Rademacher process is defined as X^i^i '^ifi^i) ■ / ^ 3^} 
for some {xi}™ C M. The Rademacher complexity over J 
is defined as 



Rmi'J; {xi}T=i) :=Esup 



^ m 
m ^ — ^ 



(28) 



We now present a general result that provides a probabilistic 
bound on the deviation of 73^(Pm,Qri) from 7gr(P, Q). This 
generalizes [26, Theorem 4], the main difference being that 
we now consider function classes other than RKHSs, and thus 
express the bound in terms of the Rademacher complexities 
(see the proof for further discussion). 
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Theorem 11: For any 7 such that v :— sup^^j^^ < oo, 
with probabiHty at least 1 — S, the following holds: 



l7?(Pm,Qn)-7:f(IP,Q)l < a/ 181^2 log 4 (1+1 

V \y/m V" 

+2i?,„(J; {Xf ^}) + 2i?„(:r; {xf ^}).(29) 

Proof: See Appendix ICl ■ 

Theorem [TT] holds for any 3^ for which i' is finite. However, to 
obtain the rate of convergence for 73^(F„j, (Q)„), one requires 
an estimate of J; {xf ^}™ i) and i?„(y; {xf 
Note that if {xf ^j^J ^ as m ^ oo and 



as n ^ oo, then 

as m, n — > cx). 

Op(r™) and 



|7?(IPm,'Qn)-7?(F,Q)l 

Also note that if {xf^}™ J 

{Xf = OQ(r„), then from 

llA^m, Q„) -7t(P, Q)I = Op,Q(r„ V +r„ V n-^/^)^ 

where a V b := max (a, 6). The following corollary to 
Theorem[TT|provides the rate of convergence for W, f3 and 7^. 
Note that Corollary [T2]'//j was proved in [26], [51, Appendix 
A. 2] by a more direct argument, where the fact that J'fe is an 
RKHS was used at an earlier stage of the proof to simplify 
the reasoning. We include the result here for completeness. 



Corollary 12 (Rates of convergence for W, (3 and 7^! 
(i) Let M be a bounded subset of (K'*, || 
1 < s < 00. Then, 



\W{Prn,Qn)-W{P,i 



lis) for some 



and 



where 



|/3(P™,Qn)-/3(F,Q)| =Op^Q(r„ +r„), 



d = 1 
d>2 



In addition if M is a bounded, convex subset of (M'', 
with non-empty interior, then 

d = l 

m~^/^ logm, d ~ 2 . 
mr^l'^, d>2 



(30) 

• lis) 

(31) 



(ii) Let M be a measurable space. Suppose k is measurable 

and sup^gjyj k{x,x) < C < 00. Then, 

|7fc(Pm,Qn) - lk{P,Q)\ - Op,Q(m-l/2 + „-l/2). 

In addition, 

|7fc(IP™,Q„)-7fe(IP,Q)l ^0 as m,n^oo, 

i.e., the estimator of MMD is strongly consistent. 

Proof: (i) Define i?^(?) := {xf J. The 

generalized entropy bound [41, Theorem 16] gives that for 
every e > 0, 



Rl{J)<2e + 



4^/2 



v/7^(t,?, L2{Prn))dT. (32) 



Let J' = 3^w Since M is a bounded subset of R'', it is 
totally bounded and therefore the entropy number in (|32] | can 
be bounded through (|27| | by noting that 



n{T, i2(p,„)) < n{T, Jw, II • lloo) < 



Ci C2 



, (33) 



where we have used the fact that JV{e, M, 
0{e-'^), 1 < s < CX3 and log(rx] +1) < x + lU The 
constants Ci and C2 depend only on the properties of M 
and are independent of r. Substituting (|33] | in (l32l l. we have 



Rli^w) < inf 

e>0 



< inf 

e>0 



2e- 



2e- 



4V2 



/m 
4V2 

/^./5/4 lT('i+l)/2 



where R diam(Af). Note the change in upper limits of the 
integral from oo to 4i?. This is because M is totally bounded 
and 7^(r, Jw, || • ||oo) depends on 7V(r/4, M, || • \\s). The rates 
in (|30] | are simply obtained by solving the right hand side of 
the above inequality. As mentioned in the paragraph preceding 
the statement of Corollary [12] we have V 

m-1/2 = r„, and 

so the result for W^(Pm,Qn) follows. 

Suppose M is convex. Then M is connected. It is easy 
to see that M is also centered, i.e., for all subsets A <Z M 
with diam(A) < 2r there exists a point x & M such that 
ll^c — a||s < r for all a € A. Since Af is connected and 
centered, we have from [49] that 



,) log 2 + log (2 



< 



2diam(Af) 



(34) 



where we used the fact that Af{e,M, \\ ■ \\s) ^ 0(e"'^). C3, 
C4 and C5 are constants that depend only on the properties 
of Af and are independent of t. Substituting ( l34l l in ( l32b , we 
have. 



w. 



< inf 

e>0 



2e- 



4\/2 



2fl 



e/4 



rd/2 



dr 



Again note the change in upper limits of the integral from 
00 to 2R. This is because Ti.{T,3^w,\\ ■ ||oo) depends on 
N{t/2, M, II • lis). The rates in (ISTT l are obtained by solving the 
right hand side of the above inequality. Since r,,, V m~^/^ = 
rm, the result for W{¥m,Qn) follows. 

Since C Jw, we have i?,^„(9^^) < i?,^„(5'vv) and 
therefore, the result for /3(Fm, Qn) follows. The rates in (ISTT i 
can also be directly obtained for /3 by using the entropy 



number of i.e., ^(e,?^, 
2.7.1] in 



,) = Oie-'^) [20, Theorem 



(//j By [53, 

i?„(y,;{xf)}r 



Lemma 



22], 

Vc 
V" ■ 



Substituting these in 



and 



e/4 



'Note that for any x £ M C R'', \\x\\aa < ■ ■ ■ < II^^IU < ■ ■ ■ < l|a;||2 < 
< v^||a;||2.Therefore,Vs > 2, jV(e, M,\\-\\s) < A/^Ce, M, ||-||2) and 
VI < s< 2,Af{e,M,\\-\\s) <M{e,M, v^|| ■ II2) = A/'(£/7d, M, || ■ Ha). 
Use Af{e, M, \\ ■ \\2) = 0(e-'*) [52, Lemma 2.5]. 
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yields the result. In addition, by the Borel-CantelU lemma, 
the strong consistency of 7fc(Pm,Qn) follows. ■ 

Remark 13: (i) Note that the rate of convergence of W 
and (3 is dependent on the dimension, d, which means that in 
large dimensions, more samples are needed to obtain useful 
estimates of W and f3. Also note that the rates are independent 
of the metric, |j • ||s, 1 < s < oo. 

(ii) Note that when M is a bounded, convex subset of 
(K'^j II • lis), faster rates are obtained than for the case where 
M is just a bounded (but not convex) subset of {M.'^, \\ ■ \\s). 

( Hi) In the case of MMD, we have not made any assumptions 
on M except it being a measurable space. This means in the 
case of IR.'', the rate is independent of d, which is a very 
useful property. The condition of the kernel being bounded is 
satisfied by a host of kernels, the examples of which include 
the Gaussian kernel, k{x,y) = exp{—a\\x — yW^), cr > 0, 
Laplacian kernel, k{x,y) = exp(— cr||x — y||i), ct > 0, inverse 
multiquadrics, k{x,y) = (c^ + ||x — ylli)"*, c > 0, t > d/2, 
etc. on K*^. See Wendland [54] for more examples. As men- 
tioned before, the estimates for RmiS'k'i {X^ Y^^i) can be 
directly obtained by using the entropy numbers of 3^k- See 
Cucker and Zhou [50, Chapter 5] for the estimates of entropy 
numbers for various Jf . 

The results derived so far in this section show that the 
estimators of the Wasserstein distance, Dudley metric and 
MMD exhibit good convergence behavior, irrespective of the 
distributions, unlike the case with (/^-divergence. 

C. Simulation results 

So far, in Sections IIII-AI and IIII-BI we have presented the 
empirical estimation of W, f3 and 7^ and their convergence 
analysis. Now, the question is how good are these estimators 
in practice? In this section, we demonstrate the performance 
of these estimators through simulations. 

As we have mentioned before, given P and Q, it is usually 
difficult to exactly compute W, (3 and 7^ . However, in order to 
test the performance of their estimators, in the following, we 
consider some examples where W, (3 and 7^ can be computed 
exactly. 

1) Estimator of W: For the ease of computation, let us 
consider P and Q (defined on the Borel cr-algebra of R'^) as 
product measures, P = (gjf^^P^*) and Q = ^f^^Q^^'K where 
P(') and Q(') are defined on the Borel cr-algebra of R. In this 
setting, when p{x, y) = ||x — y|| 1, it is easy to show that 



W{F,Q) = ^ W^(I 



>(») 



where 



W^(P«,qW)= / |Fp(„(x)-i^^Q(„(x)| dx, 



(35) 



(36) 



and Fp^,,{x) -PW(( —00,0;]) [45] (see footnote |6]l. Now, in 
the following, we consider two examples where W in (l36l l can 
be computed in closed form. Note that we need M to be a 



bounded subset of R"* such that the consistency of W{F„i, Q„) 
is guaranteed by Corollary [T2I 

Example!: Let M = >if^i[ai, Si]. Suppose P'*) = 
U[a,, b,] and Q(*) ^ U[r,,s,], which are uniform distributions 
on [flij&i] and [ri,Si\ respectively, where — 00 < Ui < ri < 
b, < s, < 00. Then, it is easy to verify that VF(P(*\ Q*'') = 
(s, +n~ a, - b,)/2 and VF(P,Q) follows from dSJ. 

Figures [Ha) and [Tib) show the empirical estimates of W 
(shown in thick dotted lines) for d ~ I and d — 5 respectively. 
Figure [He) shows the behavior of W{Fm, Qn) and W{F, Q) 
for various d with a fixed sample size of m = n = 250. 
Here, we chose ai — bi = i, ri = and Si = 1 for all 
i = l,.. . ,d such that I^(PW,qW) = i, Vi and W{¥,Q) = 
|, shown in thin dotted lines in Figures [Tfa-c). Note that the 
present choice of P and Q would result in a KL-divergence 
of +00. ■ 



Example 2: Let M — xf^i[0,c. 
densities 

^'^""^ " ~d^ " ^ ^^""^ " 



Suppose P('), Q(*) have 



■0 



-fliX 



1 _ e-^i^i ' 



dx 



1 _ e-MiCi 

respectively, where Xi > 0, pi > 0. Note that p(*) and Q^'' 
are exponential distributions supported on [0, Ci] with rate 
parameters A; and pi. Then, it can be shown that 

1 1 c,(e-^'=- - e-'^'^O 

K (l-e-^'^0(l-e-''-"-) ' 

and WiP, Q) follows from (O. 

Figures ^a') and [Tfb') show the empirical estimates of 
W (shown in thick dotted lines) for d — 1 and d — 5 
respectively. Let A = (Ai, Ad), p — {pi, A pd) and 
c = (ci, Cd). In Figure [TJa'), we chose A — (3), p — (1) 
and c = (5) which gives VF(P,Q) = 0.6327. In Figure [flb'), 
we chose A = (3,2,1/2,2,7), p = (1,5,5/2,1,8) and 
c = (5,6,3,2,10), which gives VK(P, Q) = 1.9149. The 
population values W{P, Q) are shown in thin dotted lines 
in Figures [n a') and Eb'). Figure [Tic') shows Ty(P,„,Q„) 
and W(P, Q) for various d with a fixed sample size of 
m ^ n ^ 250, A = (3, 3, 3), p = (1,1, .1,1) and 
c= (5,5,. 1,5). ■ 

The empirical estimates in Figure [Tj are obtained by drawing 
N i.i.d. samples (with m ^ n ~ N/2) from P and Q and 
then solving the linear program in dTSl ). It is easy to see from 
Figures [Tla-b, a'-b') that the estimate of WiJ?, Q) improves 
with increasing sample size and that W^(P„i,Q„) estimates 
VF(P, Q) correctly, which therefore demonstrates the efficacy 
of the estimator. Figures [He) and [Tic') show the effect of 
dimensionality, d of the data on the estimate of T/F(P, Q). 
They show that at large d, the estimator has a large bias and 
more samples are needed to obtain better estimates. Error bars 
are obtained by replicating the experiment 20 times. 

2) Estimator of -fh: We now consider the performance of 
7fe(P, Q). [26], [27] have shown that when k is measurable 
and bounded. 



7fc(P,Q) = 



k{-,x)dP{x) 



M 



k{-,x)dQ{x) 
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Fig. 1. (a-b) represent the empirical estimates of tlie Wasserstein distance (sliown in tliiclc dotted lines) between P = U[—^, i]'* and Q = UlO, 1]'* with 
p(x,y) = \\x — y\\i, for increasing sample size N, where d = 1 in (a) and d = 5 in (b). Here U{li,l2\'^ represents a uniform distribution on [Zi,i2]'' 
(see Example [T] for details). Similarly, (a'-b') represent the empirical estimates of the Wasserstein distance (shown in thick dotted lines) between P and Q, 
which are truncated exponential distributions on (see Example |2] for details), for increasing sample size A'. Here d = 1 in (a') and d = 5 in (b') with 
p{x,y) = \\x — y\\i. The population values of the Wasserstein distance between P and Q are shown in thin dotted lines in (a-c, a'-c'). (c, c') represent the 
behavior of W{¥m,Qn) and VF(P,Q) for varying d with a fixed sample size of m = n = 250 (see Examples [T] and [2] for details on the choice of P and 
Q). Error bars are obtained by replicating the experiment 20 times. 

as the integrals in ( [37] i simply involve the convolution of 
Gaussian distributions. 

Figures |2ja-b) show the empirical estimates of jk (shown 
in thick dotted lines) for d = 1 and d = 5 respectively. 
Figure |2c) shows the behavior of jki^mjQji) and 7i;(P, Q) 
for varying d with a fixed sample size of m = n = 250. 
Here we chose iii — 0, A.; = 1, ct^ = \/2, Oi ~ \f2 for all 
i — 1, . . . , d and r = 1. Using these values in (|38] l. it is easy 
to check that 7fc(P,Q) = 5-'*/4(2 - ^er'^l^^fl'^ , which is 
shown in thin dotted lines in Figures |2ja-c). We remark that 
an alternative estimator of 7^ exists which does not suffer from 
bias at small sample sizes: see [26]. ■ 



y) dP(a;) dP(2/) + / k{x,y)dQ{x) 
-2 / k{x,y)dP{x) 



(37) 

Note that, although 7^ (P, Q) has a closed form in (|37] |. exact 
computation is not always possible for all choices of k, P and 
Q. In such cases, one has to resort to numerical techniques 
to compute the integrals in ( [37] i. In the following, we present 
two examples where we choose P and Q such that 7fc(P, Q) 
can be computed exactly, which is then used to verify the 
performance of 7fc(Pm, Qn)- Also note that for the consistency 
of 7fc(Pm, Qn), by Corollary |9] we just need the kernel, k to 
be measurable and bounded and no assumptions on M are 
required. 

,P(*) and O = 



Example 3: Let M 



(0 



Suppose P(') = N{n^,af 



and 0(*) 



where N{fi,a ) represents a Gaussian distribution with mean 
/i and variance cr^. Let k{x,y) — exp(— ||a; — uW^/'^t'^). 
Clearly k is measurable and bounded. With this choice of k, 
P and Q, 7^ in (|37] | can be computed exactly as 



Example 4: Let M = K^, P = «)f^iP(*) and Q = 
®f^iQ(*). Suppose P(') = Exp(l/A,) and Q(*) = Exp(l/^j), 
which are exponential distributions on R+ with rate param- 
eters Ai > and ^1 > respectively. Suppose k{x, y) — 
exp(— a||x — y||i), a > 0, which is a Laplacian kernel on M''. 
Then, it is easy to verify that jk (P, Q) in ( l37T i reduces to 



d 



7^(]p,Q)=n 



A,; 



A,; + a 



n 



Mi 



i=l 

Ai/^i(Ai 



2a) 



(38) 



Figures |2ja'-b') show the empirical estimates of 7^, (shown 
in thick dotted lines) for d = 1 and d ~ 5 respectively. Fig- 
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Fig. 2. (a-b) represent the empirical estimates of MMD (shown in thick dotted lines) between P = N{0,2Ij) and Q = N{l,2Ij) with k{x,y) = 
exp(— ij|a; — ?^|||), for increasing sample size N, where d = 1 in (a) and d = 5 in (b) (see Example |3] for details). Here N{ij,, cr^Id) represents a normal 

distribution with mean vector {^\, A ., ^^l) and covariance matrix cr^Ia- Id represents the dx d identity matrix. Similarly, (a'-b') represent the empirical 
estimates of MMD (shown in thiclc dotted lines) between P and Q, which are exponential distributions on (see Example |4] for details), for increasing 
sample size N. Here d = 1 in (a') and d = 5 in (b') with k(x,y) = exp(— — The population values of MMD are shown in thin dotted lines in 

(a-c, a'-c'). (c, c') represent the behavior of 7fc(Pm,Qn) and 7fc(P,Q) for varying d with a fixed sample size of m = n = 250 (see Examples [3] and |4] for 
details on the choice of P and Q). Error bars are obtained by replicating the experiment 20 times. 



ure|2jc') shows the dependence of 7fe(Pm,Qn) and 7fc(P, Q) 
on d at a fixed sample size of m = n = 250. Here, we chose 
{Xi}f^i and as in Example |2] with a — j, which 

gives 7fc(P,Q) = 0.2481 for d = 1 and 0.3892 for d = 5, 
shown in thin dotted lines in Figures |2a'-c'). ■ 

As in the case of W, the performance of 7fc(Pm,Qn) is 
verified by drawing N i.i.d. samples (with m — n — N/2) 
from P and Q and computing 7fc(Pm, Q„) in ( l25T l. Figures|2a- 
b, a'-b') show the performance of 7fc(Pm,Qri) for various 
sample sizes and some fixed d. It is easy to see that the quality 
of the estimate improves with increasing sample size and that 
1k{^m,Qn) estimates 7fc(P, Q) correctly. On the other hand, 
Figures|2jc, c') demonstrate that 7fc(Pm, Qn) is biased at large 
d and more samples are needed to obtain better estimates. As 
in the case of W, the error bars are obtained by replicating 
the experiment 20 times. 

3) Estimator of j3: In the case of W and 7^, we have 
some closed form expression to start with (see ( |36] | and dSTli). 
which can be solved by numerical methods. The resulting 
value is then used as the baseline to test the performance of 
the estimators of W and 7^. On the other hand, in the case 
of P, we are not aware of any such closed form expression 
to compute the baseline. However, it is possible to compute 
/3(P, Q) when P and Q are discrete distributions on M, i.e., 
P = ELi^z^x., Q = ELiA^^fe., where ELi = 1, 



^J.i = 1, A, > 0, Vi, /i, > 0, Vi, and Xi, Z, G M. This 
is because, for this choice of P and Q, we have 

r s 

/3(P,Q) =sup{^A,/(X,) : < l} 

= sup{^^^,/(F,) : II/IIbl < 1}, (39) 

where d = (Ai, . . . , A^, -^1, . . . , -/x^), V = 

with e, {d), and V, := {V\. 
Now, (|39] l is of the form of ( fTTT i and so, by Theorem |6] 
/3(P, Q) = X]i=i ^iO*, where {a*} solve the following linear 
program, 

r+s 

max > Qiai 

ai ,. . . ,aT-+s ,6,c — ^ 
i—\ 

s.t. ~hp{V,, Vj) < a,; - aj < h p{V^, V,-), Vi, j 
— c < ai < c,y i 

b + c<l. (40) 

Therefore, for these distributions, one can compute the base- 
line which can then be used to verify the performance of 
/3(Pm,Q„). In the following, we consider a simple example 
to demonstrate the performance of f3{P„i,Qn)- 

Example 5: Let M = {0,1,2,3,4,5} C M, A = 
M - (iiii), X = (0,1,2,3,4) and 
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Fig. 3. Empirical estimates of the Dudley metric (shown in a thick dotted line) 
between discrete distributions P and Q on M (see Example |5] for details), for 
increasing sample size A'^. The population value of the Dudley metric is shown 
in a thin dotted line. En'or bars are obtained by replicating the experiment 20 
times. 

Z = (2,3,4,5). With this choice, P and Q are defined as 
IP = Eti ^^^x, and Q = Y:ti ^^^Sz,■ By solving gOj with 
pix,y) = \x- y\, we get /3(P, Q) = 0.5278. Note that the 
KL-divergence between P and Q is +00. 

Figure [3] shows the empirical estimates of /3(P, Q) (shown 
in a thick dotted line) which are computed by drawing N i.i.d. 
samples (with m = n = N/2) from P and Q and solving the 
linear program in ( 1211 1. It can be seen that /3(Pm , Qn) estimates 
/3(P, Q) correctly. ■ 



Since we do not know how to compute /3(P, Q) for P and Q 
other than the ones we discussed here, we do not provide 
any other non-trivial examples to test the performance of 

/3(Pr«,Q«). 

D. Non-parametric estimation of total variation distance 

So far, the results in Section IIII-At4III-Cl show that IPMs 
exhibit nice properties compared to that of i/i-divergences. As 
shown in Section [III since the total variation distance. 



rF(P,( 



sup{P/-Q/:||/||oo<l}, 



(41) 



is both an IPM and 0-divergence, in this section, we consider 
its empirical estimation and the consistency analysis. Let 
rF(P„,Q„) be an empirical estimator of TF(P,Q). Using 
similar arguments as in Theorems |5] and |6] it can be shown 
that 



T1/(P„ 



N 

E 

4=1 



(42) 



where {a*}f^i solve the following linear program. 



N 

max y YiUi 

i=l 

s.t. -1 < a, < 1, Vi. 



(43) 



Now, the question is whether this estimator consistent. To 
answer this question, we consider an equivalent representation 
of TV given as 

TViF, Q) = 2 sup |P(A) - Q{A)\, (44) 

where the supremum is taken over all measurable subsets A of 
M [48]. Note that |TF(P™, Q„)-TF(P, (Q)| < TF(P™,P) + 



TV{Qn, Q). It is easy to see that TF(P„i, P) as m 00 
for all P and therefore, the estimator in (|42] | is not strongly 
consistent. This is because if P is absolutely continuous, then 
rV^(P.,„,P) = 2, where we have considered the set A that is 
the finite support of P^ such that Pm{A) = 1 and F{A) = 0. 
In fact, Devroye and Gyorfi [48] have proved that for any 
empirical measure, P^ (a function depending on 
assigning a nonnegative number to any measurable set), there 
exists a distribution, P such that for all to. 



sup |P„(A)-P(A)| > i a.s. 



(45) 



This indicates that, for the strong consistency of distribution 
estimates in total variation, the set of probability measures has 
to be restricted. Barron et al. [55] have studied the classes 
of distributions that can be estimated consistently in total 
variation. Therefore, for such distributions, the total variation 
distance between them can be estimated by an estimator that 
is strongly consistent. 

The issue in the estimation of TV{P,Q) is that the set 
3'tv ■= {/ : ll/lloo < 1} is too large to obtain meaningful 
results if no assumptions on distributions are made. On the 
other hand, one can choose a more manageable subset ? 
of S'tv such that 7;f(P,Q) < TV{P,Q_), VP,Q e ^ and 
73^(IPm,Qn) is a consistent estimator of 7gr(P, Q). Examples 
of such choice of 3^ include J'p and {ll(-oo,t] ■ t G M''}, 
where the former yields the Dudley metric while the latter 
results in the Kolmogorov distance. The empirical estimator 
of the Dudley metric and its consistency have been presented 
in Sections IIII-AI and IIII-BI The empirical estimator of the 
Kolmogorov distance between P and (Q is well studied and 
is strongly consistent, which simply follows from the famous 
Glivenko-Cantelli theorem [43, Theorem 12.4]. 

Since the total variation distance between P and Q cannot be 
estimated consistently for all P, Q G ,0^, in the following, we 
present two lower bounds on TV, one involving W and j3 and 
the other involving 7^, which can be estimated consistently. 

Theorem 14 (Lower bounds on TV): (i) For all P ^ Q, 
P, Q G ^, we have 

T4^(P,Q)/3(P,Q) 



(ii) Suppose C :— sup^j^jv/ k{x, x) < 00. Then 



TF(P,( 



> 



c 



(46) 



(47) 



Before, we prove Theorem [14] we present a simple lemma. 

Lemma 15: Let : V M. and -0 : y ^ K. be convex 
functions on a real vector space V . Suppose 

a = sup{6'(x) : ?A(a;) < b}, (48) 

where 9 is not constant on {x : iIj{x) < b} and a < 00. Then 

b = mi{^{x) : e{x) > a}. (49) 

Proof: See Appendix iDl ■ 

Proof of Theorem [T4[ (i) Note that ||/||l, and 
ll/lloo are convex functionals on the vector spaces Lip(M, p). 
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BL{M,p) and U{M) {/ : M ^ M|||/|lco < 00} 
respectively. Similarly, P/ — Q/ is a convex functional on 
Lip(Af, p), BL{M, p) and U{M). Since P 7^ Q, P/ - Q/ 
is not constant on 'Jw, and 7tv- Therefore, by appro- 
priately choosing ijj, 6, V and b in Lemma [15] the following 
sequence of inequalities are obtained. Define /3 := /3(P, Q), 
W := W(¥,Q), TV := TV{F,Q). 

1 = inf{||/||BL : P/ - Q/ > /?, / e BL{M,p)} 
> mf{ 11/11 L : P/ - Q/ > /3, / e Si(Af,p)} 

+ inf{||/||oo :P/-Q./>/3, /e BL(Af,p)} 



if{||/|U:P/- 



TV 



inf{||/||. 



>^inf{||/|U 



inf{||/||. 



f-Qf>TV,feBL{M,p)} 
^J>W, f eUpiM,p)} 

f-Q.f>TV,feU{M)} 



TV 

= A + JL 

W TV 
which gives ( |46] |. 

(ii) To prove ( |47] |. we use the coupling formulation for TV 
[56, p. 19] given by 

TV{F,Q)^2 inf p{X^Y), (50) 

where £(P, Q) is the set of all measures on M x M with 
marginals P and Q. Here, X and Y are distributed as P and 
Q respectively. Let Z{F, Q) and / € M. Then 



fd{l 



M 





ju{^)~f{y))d\{x,v) 






[ \f{x)^fiy)\d\ix,y) 




(a) ^ 


[ |(/,fc(.,a;)-fc(-,y))^| 


d\{x,y) 


(ft) 






< II 


/IK / \\ki;x)-ki;y)\ 


^dX{x,y), 



where we have used the reproducing property of !K in (a) and 
the Cauchy-Schwartz inequality in (b). Taking the supremum 
over / e J'fc and the infimum over A e (P, Q) gives 



< inf 



\ki;x)-ki-,y)\\:KdXix,y). (51) 



Consider 

\\k{-,x) - fc(-,y)||M < t:^^y\\k{-,x) - k{-,y)\\^ 

< \.,^y [||fc(-,x)|ljc + ||fc(-,y)|K 

Using ^ in dlB yields ( |47] |. 



vRavr) + Vk{y,y) 



(52) 



Remark 16: (i) As mentioned before, a simple lower bound 
on TV can be obtained as TV{P,Q) > /3(P,Q), VP,Q G 
It is easy to see that the bound in ( |46] ) is tighter as 



wiF,m-m) - ^) ^'I'^^^^'y ^"'^ if P = Q. 

From (|46]l, it is easy to see that TV{¥,Q) = or 
W{V,Q) = implies /3(P, Q) = while the converse is 
not true. This shows that the topology induced by /3 on ^ is 
coarser than the topology induced by either W or TV. 

( Hi) The bounds in ( |46] | and ( |47] ) translate as lower bounds on 
the KL-divergence through Pinsker's inequality: TV^{P, Q) < 
2KL{V,Q), VP,Q e ^. See Fedotov et al. [36] and refer- 
ences therein for more refined bounds between TV and KL. 
Therefore, using these bounds, one can obtain a consistent 
estimate of a lower bound on TV and KL. The bounds in 
(l46b and (|47] | also translate to lower bounds on other distance 
measures on See [57] for a detailed discussion on the 
relation between various metrics. 

To summarize, in this section, we have considered the em- 
pirical estimation of IPMs along with their convergence rate 
analysis. We have shown that IPMs such as the Wasserstein 
distance, Dudley metric and MMD are simpler to estimate 
than the KL-divergence. This is because the Wasserstein 
distance and Dudley metric are estimated by solving a linear 
program while estimating the KL-divergence involves solving 
a quadratic program [35]. Even more, the estimator of MMD 
has a simple closed form expression. On the other hand, 
space partitioning schemes like in [17], to estimate the KL- 
divergence, become increasingly difficult to implement as the 
number of dimensions increases whereas an increased number 
of dimensions has only a mild effect on the complexity 
of estimating W, (3 and 7^. In addition, the estimators of 
IPMs, especially the Wasserstein distance, Dudley metric and 
MMD, exhibit good convergence behavior compared to KL- 
divergence estimators as the latter can have an arbitrarily slow 
rate of convergence depending on the probability distributions 
[17], [35]. With these advantages, we believe that IPMs can 
find applications in information theory, detection theory, image 
processing, machine learning, neuroscience and other areas. As 
an example, in the following section, we show how IPMs are 
related to binary classification. 



iv. interpretability of ipms: relation to binary 
Classification 

In this section, we provide different interpretations of IPMs 
by relating them to the problem of binary classification. 
First, in Section IIV-AI we provide a novel interpretation for 
P, W, TV and 7^ (see Theorem [TT] ). as the optimal risk 
associated with an appropriate binary classification problem. 
Second, in Section IIV-BI we relate W and /3 to the margin 
of the Lipschitz classifier [41] and the bounded Lipschitz 
classifier respectively. The significance of this result is that the 
smoothness of Lipschitz and bounded Lipschitz classifiers is 
inversely related to the distance between the class-conditional 
distributions, computed using W and [3 respectively. Third, 
in Section IIV-CI we discuss the relation between -fk and the 
Parzen window classifier [37], [42] (also called the kernel 
classification rule [43, Chapter 10]). 
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A. Interpretation of (3, W, TV and jk <2S the optimal risk of 
a binary classification problem 

Let us consider the binary classification problem with X be- 
ing a Af -valued random variable, Y being a {—1, +l}-valued 
random variable and the product space, M x {—1, +1}, being 
endowed with a Borel probability measure /i. A discriminant 
function, / is a real valued measurable function on M, whose 
sign is used to make a classification decision. Given a loss 
function, L : {—1,-1-1} x M R, the goal is to choose an 
/ that minimizes the risk associated with L, with the optimal 
i-risk being defined as, 

Rk = M I L{y,f{x))dfi{x,y) 

= ig^{e J^^L,if{x))dP{x) 

+ (!-£)/ L-iif{x))dQ{x)}, (53) 

where Li{a) := L(l,a), i-i(a) := L{-l,a), P{X) := 
^iiX\Y = +1), Q{X) ^l{X\Y = -1), e := ^i{M,Y = 
+1). Here, P and Q represent the class-conditional distribu- 
tions and e is the prior distribution of class +1. 

By appropriately choosing L, Nguyen et al. [33] have shown 
an equivalence between (/>-divergences (between P and Q) and 
Rg-^- In particular, they showed that for each loss function, 
L, there exists exactly one corresponding 0-divergence such 
that the i?^ ~ —D^(P,Q). For example, the total-variation 
distance, Bellinger distance and -divergence are shown to 
be related to the optimal L-risk where L is the hinge loss 
(L{y,a) = max(0, 1 — ya)), exponential loss {L{y,a) = 
exp(— j/a)) and logistic loss {L{y,a) = log(l -I- exp{—ya))) 
respectively. In statistical machine learning, these losses are 
well-studied and are shown to result in various binary classifi- 
cation algorithms like support vector machines, Adaboost and 
logistic regression. See [37], [58] for details. 

Similarly, by appropriately choosing L, we present and 
prove the following result that relates IPMs (between the class- 
conditional distributions) and the optimal L-risk of a binary 
classification problem. 

Theorem 17 (jg^ and associated L): Let Li(a) = — ^ and 
L_i(q!) = Let 3" cS'^he such that / G 5" ^ -/G J. 
Then, 7?(P,Q; 



Proof: From ( l53T l, we have 

e / Li(/)dP+(l-e) / L-i(/)< 

JM JM 



fdQ- / fdF = Qf-Pf. (54) 

M JM 



Therefore, 



R^ = inf (Q/ - P/) = - sup(P/ - 



fe'J 



(a) 



sup|P/-Q/| = -7j(P,Q), (55) 



where (a) follows from the fact that ? is symmetric around 
zero, i.e., / G J ^ -/ G J. ■ 



Theorem[T7] shows that 7y(P, Q) is the negative of the optimal 
L-risk that is associated with a binary classifier that classifies 
the class-conditional distributions P and Q using the loss 
function, L, in Theorem [17] when the discriminant function 
is restricted to 3^. Therefore, Theorem [17] provides a novel 
interpretation for the total variation distance, Dudley metric, 
Wasserstein distance and MMD, which can be understood as 
the optimal L-risk associated with binary classifiers where the 
discriminant function, / is restricted to J'tv^ 3^/3, and J'k 
respectively. 

Suppose, we are given a finite number of samples 
{{X^,Y,)}^^„ X, G M, Y, G {-l,+l},Vz drawn i.i.d. 
from /i and we would like to build a classifier, / G 3^ that 
minimizes the expected loss (with L as in Theorem [TT]) based 
on this finite number of samples. This is usually carried out by 
solving an empirical equivalent of ( 153] ). which reduces to ( fTTl ). 

}„) = sup{| Eti Y^fiX^)\ : / e by noting 
Xi when Y, = 1, X^^) := Xi when Y, ^ ~1, 
and f E J' ^ —f E 3^. This means the sign of / G 3^ that 
solves (fTTl) is the classifier we are looking for. 



I.e., 7gr(f 
that 



B. Wasserstein distance and Dudley metric: Relation to Lips- 
chitz and bounded Lipschitz classifiers 

The Lipschitz classifier is defined as the solution, /up to the 
following program: 

I/IIl 



inf 

/eLip(A/,p) 



s.t. Y,f{Xi) >l,i = l,...,N, 
which is a large margin classifier with margir0 t. 



(56) 
The 



program in ( [56] ) computes a smooth function, / that classifies 
the training sequence, correctly (note that the 

constraints in ( [56] l are such that sign(/(Xi)) — Yi, which 
means / classifies the training sequence correctly, assuming 
the training sequence is separable). The smoothness is con- 
trolled by II/IIl (the smaller the value of the smoother / 
and vice-versa). See [41] for a detailed study on the Lipschitz 
classifier Replacing 1 1 / 1 1 ^ by 1 1 / 1 1 in ([56) gives the bounded 
Lipschitz classifier, /bl which is the solution to the following 
program: 

inf ^ ll/llsL 

feBL{M,p) 

s.t. YJ{X,) > 1, i = l,...,iV. (57) 

Note that replacing by in ( [56] l. taking the infimum 

over f E J-C, yields the hard-margin support vector machine 
(SVM) [59]. We now show how the empirical estimates of W 
and (3 appear as upper bounds on the margins of the Lipschitz 
and bounded Lipschitz classifiers, respectively. 

Theorem 18: The Wasserstein distance and Dudley metric 
are related to the margins of Lipschitz and bounded Lipschitz 
classifiers as 



1 



/liplli 
1 



< 



ll/l 



< 



T^(P™,Q„) 



/3(Pm,Qn) 



(58) 
(59) 



BL IBL 



*The margin is a technical term used in statistical machine learning. See 
[37] for details. 



15 



Proof: Define W™„ M^(P™,Q„). By Lemma [T5l we 

have 

N 

1 = inf {ll/IU : > Wmn, f e Lip(Af,p)}, 

i=l 

which can be written as 

-— = inf {ll/IU ■.Y.YJiX,) > 2, / e Lip(M,p)}. 

2—1 

Note that {/ e Lip(A/,p) : > 1, Vi} C {/ G 

Lip(Af, p) : Yifi^i) > 2}, and therefore 

< inf : YJ{X,) > 1, Vz, / G Lip(M,p)|, 

hence proving dSSl l. Similar analysis for /3 yields ( |59] l. ■ 

The significance of this result is as follows. ( ISST l shows that 
||/iip||i > ^^p^ Q ^ , which means the smoothness of the 
classifier, /^p, computed as ||/iip||L is bounded by the inverse 
of the Wasserstein distance between P.,„ and Q„. So, if the 
distance between the class-conditionals P and Q is "small" 
(in terms of W), then the resulting Lipschitz classifier is less 
smooth, i.e., a "complex" classifier is required to classify the 
distributions P and Q. A similar explanation holds for the 
bounded Lipschitz classifier. 

C. Maximum mean discrepancy: Relation to Parzen window 
classifier and support vector machine 

Consider the maximizer /, for the empirical estimator of 
MMD, in ( |24T ). Computing y ~ sign(/(a;)) gives 

„ ^ r +1, ^ Er.=i M^, X,) > i Ey,=-i Hx, X,) 

^ l-l, ;^Ey,=iM^,^.)<^Ey,=-ifc(2;,^.)' 

(60) 

which is exactly the classification function of a Parzen window 
classifier [37], [42]. It is easy to see that (l60l l can be rewritten 
as 

2/ = sign((w, fc(-,a:))M), (61) 

where w = /i+ - ^+ := -^Y^Yi^i^i'^Xi) and /i" := 
i fc(-, Xi). and pT represent the class means 

associated with X+ := {Xi : Yi = 1} and X^ := {Xi : 
Yi = —1} respectively. 

The Parzen window classification rule in (1611 1 can be inter- 
preted as a mean classifier in 3i: {w,k{-,x))j^ represents a 
hyperplane in 3i passing through the origin with w being its 
normal along the direction that joins the means, /i+ and /i^ 
in J{. From we can see that 7fe(P„i,Q„) is the RKHS 
distance between the mean functions, /i+ and p^. 

Suppose = i.e., /i+ and /i^ are equidistant 

from the origin in 3{. Then, the rule in (1611 1 can be equivalently 
written as 

y = sign {\\k{;x) - - \\k{;x) ~ p+Wl,) . (62) 

(l62T l provides another interpretation of the rule in (|60] |. i.e., as 
a nearest-neighbor rule: assign to x the label associated with 
the mean /i+ or p^ , depending on which mean function is 
closest to k{-,x) in IK. 



The classification rule in (|60] | differs from the "classical" 
Parzen window classifier in two respects, (i) Usually, the 
kernel (called the smoothing kernel) in the Parzen window 
rule is translation invariant in M''. In our case, M need not 
be M.'^ and k need not be translation invariant. So, the rule in 
(|60] | can be seen as a generalization of the classical Parzen 
window rule, (ii) The kernel in (|60] | is positive definite unlike 
in the classical Parzen window rule where k need not have to 
be so. 

Recently, Reid and Williamson [40, Section 8, Appendix 
E] have related MMD to Fisher discriminant analysis [43, 
Section 4.3] in J{ and SVM [59]. Our approach to relate 
MMD to SVM is along the lines of Theorem [TSl where it is 
easy to see that the margin of an SVM, computed as ™ — , 

II / II !H 

can be upper bounded by lilZ^-iQ^J. ^ which says that the 
smoothness of an SVM classifier is bounded by the inverse 
of the MMD between P and Q. 

To summarize, in this section, we have provided an intuitive 
understanding of IPMs by relating them to the binary clas- 
sification problem. We showed that IPMs can be interpreted 
either in terms of the risk associated with an appropriate binary 
classifier or in terms of the smoothness of the classifier 

V. Conclusion & Discussion 

In this work, we presented integral probability metrics 
(IPMs) from a more practical perspective. We first proved that 
IPMs and (/)-divergences are essentially different: indeed, the 
total variation distance is the only "non-trivial" ^-divergence 
that is also an IPM. We then demonstrated consistency and 
convergence rates of the empirical estimators of IPMs, and 
showed that the empirical estimators of the Wasserstein dis- 
tance, Dudley metric, and maximum mean discrepancy are 
strongly consistent and have a good convergence behavior. 
In addition, we showed these estimators to be very easy to 
compute, unlike for ^-divergences. Finally, we found that 
IPMs naturally appear in a binary classification setting, first 
by relating them to the optimal L-risk of a binary classifier; 
and second, by relating the Wasserstein distance to the margin 
of a Lipschitz classifier, the Dudley metric to the margin of a 
bounded Lipschitz classifier, and the maximum mean discrep- 
ancy to the Parzen window classifier With many IPMs having 
been used only as theoretical tools, we believe that this study 
highlights properties of IPMs that have not been explored 
before and would improve their practical applicability. 

There are several interesting problems yet to be explored in 
connection with this work. The minimax rate for estimating 
W, j3 and 7^ has not been established, nor is it known 
whether the proposed estimators achieve this rate. It may also 
be possible to relate IPMs and Bregman divergences. On the 
most basic level, these two families do not intersect: Bregman 
divergences do not satisfy the triangle inequality, whereas 
IPMs do (which are pseudometrics on ^). Recently, however, 
Chen et al. [60], [61] have studied "square-root metrics" based 
on Bregman divergences. One could investigate conditions on 
J for which 73^ coincides with such a family. 

Similarly, in the case of ^-divergences, some functions of 
are shown to be metrics on (see Theorem |2] for 
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the notation), for example, the square root of the variational 
distance, the square root of Bellinger's distance, the square 
root of the Jensen-Shannon divergence [62]-[64], etc. Also, 
Osterreicher and Vajda [65, Theorem 1] have shown that 
certain powers of are metrics on Therefore, one could 
investigate conditions on 3^ for which 79^ equals such functions 
of D^. 



as EE<,sup/gj hTZi^^^Jk^i) 
pectation, which we denote as IE 

= 1 

outer expectation is taken with respect to \X\^'YiL\- 
Since sup^^g. \^YZ.i satisfies (see Ap- 

pendix |E|i with Ci = 1^, by McDiarmid's inequality in (|70l ) 
(see Appendix |E]), with probability at least 1 — |, we have 



where the inner ex- 
, is taken with re- 
spect to {cTj}™]^ conditioned on {X^^^''}™]^ and the 



Appendix A 
Proof of Lemma[3] 

(<;=) If (/) is of the form in (|9]l, then by Lemma |4] we 
have D0(P, Q) = /^^ |p - q\ dX, which is a metric on 
^\ if /3 > a and therefore is a pseudometric on If 
(3 = a, D^{f, Q) = for all P, Q e and therefore is a 
pseudometric on 3^x. 

(^) If Z)0 is a pseudometric on then it satisfies the 
triangle inequahty and (P = Q ^ D^{f,Q) = 0) and 
therefore by [44, Theorem 2], is of the form in (|9]l. 



Appendix B 
Proof of Theorem[8] 



Consider |73^(P„i, 

Q„/| - sup^gjIP/ 

|P/ - Q/ll < SUp^gj 
sup^gj[|P„J-P/| + |Q„ 
sup/egr |Q„/ 



n) - 7:r(]P 
< 



SUP/ggr|Pm/ 
SUp^gjIlP™/ - Q„/| 

- Q„/ - P/ + Q/l 
/-Q/|] <sup/gy|P„/-P/|4 
Therefore, by Theorem |2T| (see Ap 



< 



pendixEli, sup^ggr |Pm/— P/| 0, supj 
and the result follows. 



£3^ I 



Appendix C 
Proof of Theorem[TT] 

From the proof of Theorem |8] we have |73^(Pm,Qn) — 
7?(P, Q)\ < sup^e J |Pm/ - P./I + SUp/6? \Qnf - Qf\- We 
now bound the terms supj^gr |P-m/ — P/| and supy^gr |Q„/ — 
Q/|, which are the fundamental quantities that appear in 
empirical process theory. The proof strategy begins in a 
manner similar to [51, Appendix A. 2], but with an additional 
step which will be flagged below. 

Note that sup^ggr |P„i/ - P/| satisfies (see Ap- 

pendix|E|i with Ci = —. Therefore, by McDiarmid's inequality 
in ( TTOI ) (see Appendix|E|, we have that with probability at least 
1 — |, the following holds: 



E sup 



i=l 



< Eo- sup 



- m 
m ^ — ^ 



(i)> 



4=1 



2i/2 4 

log 7. 

m 



(64) 



Tying ( |63] l and ( [64] i. we have that with probability at least 
1 — |, the following holds: 



sup I 

fe'J 



Performing similar analysis for supy^gr | 
that with probability at least 1 — §, 



18j/2 , 4 
log^. 

m 

(65) 
we have 



sup I 



18j/2 , 4 

log 7. 

n 



(66) 

The result follows by adding ( l65T l and ( |66] |. Note that the 
second application of McDiarmid was not needed in [51, 
Appendix A.2], since in that case a simplification was possible 
due to 3^ being restricted to RKHSs. 

Appendix D 
Proof of Lemma[T5] 

Note that A :— {x : tp{x) < 6} is a convex subset of V. 
Since 6 is not constant on A, by Theoi'eml24l(see Appendix|E|, 
6 attains its supremum on the boundary of A. Therefore, any 
solution, X* to ( |48] | satisfies 9{x^,) = a and ■tp{x^,) = b. Let 
G := {x : 6{x) > a}. For any x E G, ip{x) > b. If this were 
not the case, then is not a solution to (|48] |. Let H := {x : 
9{x) — a]. Clearly, x^ £ H and so there exists an x E H for 
which ijj{x) = b. Suppose inf{-0(a;) : x E H} — c < b, which 
means for some x* E H, x* E A. From ( |48] |. this implies 9 
attains its supremum relative to A at some point of relative 
interior of A. By Theorem |24l this implies 9 is constant on A 
leading to a contradiction. Therefore, mi{ip{x) : x E H} — b 
and the result in ( |49] l follows. 



sup I 



,/-P/| <Esup| 



./-P./I 



(f^) I 1 NT^ 

< 2E sup — ) 



2v^ , 4 

log^ 

m 

2l^2 4 

log ^,(63) 

m 



where (a) follows from bounding Esup^ggr |P,„/ — P/| 
by using the symmetrization inequality in ( ItTI i (see 
Appendix |E]i. Note that the expectation in the sec- 
ond Une of ( |63] | is taken jointly over {cr,}™]^ and 



.(l)^ 



can be written 



Appendix E 
Supplementary Results 

In this section, we collect results that are used to prove 
results in Section Hill 

We quote the following result on Lipschitz extensions from 
[41] (see also [66], [67]). 

Lemma 19 (Lipschitz extension): Given a function / de- 
fined on a finite subset xi, . . . , a;„ of M, there exists a function 
/ which coincides with / on xi, . . . , x„, is defined on the 
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whole space M, and has the same Lipschitz constant as /. 
Additionally, it is possible to explicitly construct / in the form 

/(x) = a min (/(a;^) + L(/)p(a;, x^)) 

+ msK {f{xi) - L{f)p{x,Xi)), (67) 

i— 1, . . . ,n 

for any a e [0, 1], with L{f ) = max,,^,^ ^^^I'l'^^)'^ ' 

The following result on bounded Lipschitz extensions is 
quoted from [9, Proposition 11.2.3]. 

Lemma 20 (Bounded Lipschitz extension): If A C M and 
/ G BL{A, p), then / can be extended to a function 
h e BL{M,p) with h = f on A and \\h\\BL = W/Wbl- 
Additionally, it is possible to explicitly construct h as 

h = max(-||/||oo,min(5, ||/||oo)) , (68) 

where g is a function on M such that g ^ f on A and \\g\\L ~ 

Wfh-' 

The following result is quoted from [52, Theorem 3.7]. 

Theorem 21: Let F{x) = supjggr be the envelope 

function for 3^. Assume that J F dP < oo, and suppose 

moreover that for any e > 0, —Ti.{e, Li(Pm)) — > 0. Then 
sup/g j(P„J - P/) ^ 0. 

Theorem 22 ( [68] McDiarmid's Inequality): Let Xi, . . . , 
Xn, X[, . . . , X^^ be independent random variables taking val- 
ues in a set M, and assume that / : M" M satisfies 

I y (xi , . . . , X.^) f {x\^ . . . , X^— 1 , , X^-f 1 , . . . , Xrfi) I ^ , 

(69) 

Vxi, . . . , x„, x'l, . . . , x^ e Af. Then for every e > 0, 

Pr(/(Xi,...,X„)-E/(Xi,...,X„) > e) <e^^=i='. 

(70) 

Lemma 23 ( [20] Symmetrization): Let cri, . . . , cttv be i.i.d. 
Rademacher random variables. Then, 




(71) 

The following result is quoted from [69, Theorem 32.1]. 

Theorem 24: Let / be a convex function, and let C be 
a convex set contained in the domain of /. If / attains its 
supremum relative to C at some point of relative interior of 
C, then / is actually constant throughout C. 
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